A New Method Based on Context for Combining Statistical Language Models
Identifieur interne : 008F28 ( Main/Exploration ); précédent : 008F27; suivant : 008F29A New Method Based on Context for Combining Statistical Language Models
Auteurs : David Langlois ; Kamel Smaïli ; Jean-Paul Haton [France]Source :
English descriptors
Abstract
In this paper we propose a new method to extract from a corpus the histories for which a given language model is better than another one. The decision is based on a measure stemmed from perplexity. This measure allows, for a given history, to compare two language models, and then to choose the best one for this history. Using this principle, and with a 20K vocabulary words, we combined two language models : a bigram and a distant bigram. The contribution of a distant bigram is significant and outperforms a bigram model by 7.5%. Moreover, the performance in Shannon game are improved. We show through this article that we proposed a cheaper framework in comparison to the maximum entropy principle, for combining language models. In addition, the selected histories for which a model is better than another one, have been collected and studied. Almost, all of them are beginnings of very frequently used French phrases. Finally, by using this principle, we achieve a better trigram model in terms of parameters and perplexity. This model is a combination of a bigram and a trigram based on a selected history.
Affiliations:
- France
- Grand Est, Lorraine (région)
- Nancy
- Centre national de la recherche scientifique, Institut national de recherche en informatique et en automatique, Laboratoire lorrain de recherche en informatique et ses applications, Université de Lorraine
Links toward previous steps (curation, corpus...)
- to stream Crin, to step Corpus: 002F06
- to stream Crin, to step Curation: 002F06
- to stream Crin, to step Checkpoint: 001544
- to stream Main, to step Merge: 009449
- to stream Main, to step Curation: 008F28
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" wicri:score="438">A New Method Based on Context for Combining Statistical Language Models</title>
</titleStmt>
<publicationStmt><idno type="RBID">CRIN:langlois01a</idno>
<date when="2001" year="2001">2001</date>
<idno type="wicri:Area/Crin/Corpus">002F06</idno>
<idno type="wicri:Area/Crin/Curation">002F06</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Curation">002F06</idno>
<idno type="wicri:Area/Crin/Checkpoint">001544</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Checkpoint">001544</idno>
<idno type="wicri:Area/Main/Merge">009449</idno>
<idno type="wicri:Area/Main/Curation">008F28</idno>
<idno type="wicri:Area/Main/Exploration">008F28</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">A New Method Based on Context for Combining Statistical Language Models</title>
<author><name sortKey="Langlois, David" sort="Langlois, David" uniqKey="Langlois D" first="David" last="Langlois">David Langlois</name>
</author>
<author><name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaïli">Kamel Smaïli</name>
</author>
<author><name sortKey="Haton, Jean Paul" sort="Haton, Jean Paul" uniqKey="Haton J" first="Jean-Paul" last="Haton">Jean-Paul Haton</name>
<affiliation><country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>combination</term>
<term>distant models</term>
<term>statistical language modeling</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en" wicri:score="3746">In this paper we propose a new method to extract from a corpus the histories for which a given language model is better than another one. The decision is based on a measure stemmed from perplexity. This measure allows, for a given history, to compare two language models, and then to choose the best one for this history. Using this principle, and with a 20K vocabulary words, we combined two language models : a bigram and a distant bigram. The contribution of a distant bigram is significant and outperforms a bigram model by 7.5%. Moreover, the performance in Shannon game are improved. We show through this article that we proposed a cheaper framework in comparison to the maximum entropy principle, for combining language models. In addition, the selected histories for which a model is better than another one, have been collected and studied. Almost, all of them are beginnings of very frequently used French phrases. Finally, by using this principle, we achieve a better trigram model in terms of parameters and perplexity. This model is a combination of a bigram and a trigram based on a selected history.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Grand Est</li>
<li>Lorraine (région)</li>
</region>
<settlement><li>Nancy</li>
</settlement>
<orgName><li>Centre national de la recherche scientifique</li>
<li>Institut national de recherche en informatique et en automatique</li>
<li>Laboratoire lorrain de recherche en informatique et ses applications</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree><noCountry><name sortKey="Langlois, David" sort="Langlois, David" uniqKey="Langlois D" first="David" last="Langlois">David Langlois</name>
<name sortKey="Smaili, Kamel" sort="Smaili, Kamel" uniqKey="Smaili K" first="Kamel" last="Smaïli">Kamel Smaïli</name>
</noCountry>
<country name="France"><region name="Grand Est"><name sortKey="Haton, Jean Paul" sort="Haton, Jean Paul" uniqKey="Haton J" first="Jean-Paul" last="Haton">Jean-Paul Haton</name>
</region>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 008F28 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 008F28 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Main |étape= Exploration |type= RBID |clé= CRIN:langlois01a |texte= A New Method Based on Context for Combining Statistical Language Models }}
This area was generated with Dilib version V0.6.33. |